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Abstract 

The  factorization  method,  first  developed  by  Tomasi  and  Kanade,  recovers  both  the  shape  of  an  object 
and  its  motion  from  a  sequence  of  images,  using  many  images  and  tracking  many  feature  points  to 
obtain  highly  redundant  feature  position  information.  The  method  robustly  processes  the  feature  trajec¬ 
tory  information  using  singular  value  decomposition  (SVD),  taking  advantage  of  the  linear  algebraic 
properties  of  orthographic  projection.  However,  an  orthographic  formulation  limits  the  range  of  motions 
the  method  can  accommodate.  Paraperspective  projection,  first  introduced  by  Ohta,  is  a  projection 
model  that  closely  approximates  perspective  projection  by  modelling  several  effects  not  modelled  under 
orthographic  projection,  while  retaining  linear  algebraic  properties.  Our  paraperspective  factorization 
method  can  be  applied  to  a  much  wider  range  of  motion  scenarios,  such  as  image  sequences  containing 
significant  translational  motion  toward  the  camera  or  across  the  image.  The  method  also  can  accommo¬ 
date  missing  or  uncertain  tracking  data,  which  occurs  when  feature  points  are  occluded  or  leave  the  field 
of  view.  We  present  the  results  of  several  experiments  which  illustrate  the  method’s  performance  in  a 
wide  range  of  situations,  including  an  aerial  image  sequence  of  terrain  taken  from  a  low-altitude  air¬ 
plane. 
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1.  Introduction 


Recovering  the  geometry  of  a  scene  and  the  motion  of  the  camera  from  a  stream  of  images  is 
an  important  task  in  a  variety  of  applications,  including  navigation,  robotic  manipulation, 
and  aerial  cartography.  While  this  is  possible  in  principle,  traditional  methods  have  failed  to 
produce  reliable  results  in  many  situations  [2]. 

Tomasi  and  Kanade  [10][11]  developed  a  robust  and  efficient  method  for  accurately  recov¬ 
ering  the  shape  and  motion  of  an  object  from  a  sequence  of  images,  called  the  factorization 
method.  It  achieves  its  accuracy  and  robustness  by  applying  a  well-understood  numerical 
computation,  the  singular  value  decomposition  (SVD),  to  a  large  number  of  images  and  fea¬ 
ture  points,  and  by  directly  computing  shape  without  computing  the  depth  as  an  intermedi¬ 
ate  step.  The  method  was  tested  on  a  variety  of  real  and  synthetic  images,  and  was  shown  to 
prerform  well  even  for  distant  objects. 

The  Tomasi-Kanade  factorization  method,  however,  assumed  an  orthographic  projection 
model,  since  it  can  be  described  by  linear  equations.  The  applicability  of  the  method  is 
therefore  limited  to  image  sequences  created  from  certain  types  of  camera  motions.  The 
orthographic  model  contains  no  notion  of  the  distance  from  the  camera  to  the  object.  As  a 
result,  shape  reconstruction  from  image  sequences  containing  large  translations  toward  or 
away  from  the  camera  often  produces  deformed  object  shapes,  as  the  method  tries  to  explain 
the  size  differences  in  the  images  by  creating  size  differences  in  the  object.  The  method  also 
supplies  no  estimation  of  translation  along  the  camera’s  optical  axis,  which  limits  its  useful¬ 
ness  for  certain  tasks. 


There  exist  several  perspective  approximations  which  capture  more  of  the  effects  of  per¬ 
spective  projection  while  remaining  linear.  Scaled  orthographic  projection,  sometimes 
referred  to  as  “weak  perspective”  [4],  accounts  for  the  scaling  effect  of  an  object  as  it  moves 
towards  and  away  from  the  camera.  Paraperspective  projection,  first  introduced  by  Ohta  [5] 
and  named  by  Aloimonos  [1],  models  the  position  effect  (an  object  is  viewed  from  different 
angles  as  it  translates  across  the  field  of  view)  as  well  as  the  scaling  effect. 


In  this  paper,  we  present  a  factorization  method  based  on  the  paraperspective  projection 
model.  The  paraperspective  factorization  method  is  still  fast,  and  robust  with  resp)ect  to 
noise.  It  can  be  applied  to  a  wider  realm  of  situations  than  the  original  factorization  method, 
such  as  sequences  containing  significant  depth  translation  or  containing  objects  close  to  the 
camera,  and  can  be  used  in  applications  where  it  is  important  to  recover  the  distance  to  the 
object  in  each  image,  such  as  navigation. 


We  begin  by  describing  our  camera  and  world  reference  frames  and  introduce  the  mathe-  ^ 
matical  notation  that  we  use.  We  review  the  original  factorization  method  as  defined  in  [11], 
presenting  it  in  a  slightly  different  manner  in  order  to  make  its  relation  to  the  paraperspec-  ^4 
tive  method  more  apparent.  We  then  present  our  paraperspective  factorization  method,  fol¬ 
lowed  by  an  extension  which  accommodates  occlusions.  We  conclude  with  the  results  of 
several  experiments  which  demonstrate  the  practicality  of  our  system. 
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2.  Problem  Description 

In  a  shape-from-motion  problem,  we  are  given  a  sequence  of  F  images  taken  from  a  camera 
that  is  moving  relative  to  an  object.  Assume  for  the  time  being  that  we  locate  P  prominent 
feature  points  in  the  first  image,  and  track  these  points  from  each  image  to  the  next,  record¬ 
ing  the  coordinates  of  each  point  p  in  each  image  /.  Each  feature  point  p  that  we 

track  corresponds  to  a  single  world  point,  located  at  position  in  some  fixed  world  coordi¬ 
nate  system.  Each  image  /  was  taken  at  some  camera  orientation,  which  we  describe  by  the 
orthonormal  unit  vectors  i^,  j^,  and  k^,  where  and  correspond  to  the  x  and  y  axes  of  the 
camera’s  image  plane,  and  points  along  the  camera’s  line  of  sight.  We  describe  the  posi¬ 
tion  of  the  camera  in  each  frame  /  by  the  vector  indicating  the  camera’s  focal  point.  This 
formulation  is  illustrated  in  Figure  1. 


The  result  of  the  feature  tracker  is  a  set  of  P  feature  point  coordinates  for  each  of 

the  F  frames  of  the  image  sequence.  From  this  information,  our  goal  is  to  estimate  the  shape 
of  the  object  as  for  each  object  point,  and  the  motion  of  the  camera  as  1/,  j^,  k/  and  y  for 
each  frame  in  the  sequence. 


In  Section  6  we  will  relax  the  requirement  that  every  feature  point  be  visible  in  every  frame, 
allowing  the  inclusion  of  feature  points  that  are  observed  and  tracked  through  only  a  portion 
of  the  sequence. 


3.  The  Orthographic  Factorization  Method 


pages 


This  section  presents  a  summary  of  the  orthographic  factorization  method  developed  by 
Tomasi  and  Kanade.  A  more  detailed  description  of  the  method  can  be  found  in  [1 1]. 

3.1.  Orthographic  Projection 

•  The  orthographic  projection  model  assumes  that  rays  are  projected  from  an  object  point 

along  the  direction  parallel  to  the  camera’s  optical  axis,  so  that  they  strike  the  image  plane 
orthogonally,  as  illustrated  in  Figure  2.  A  point  p  whose  location  is  will  be  observed  in 


Orthographic  projection  in  two  dimensions 

Dotted  lines  indicate  true  perspective  projection 


frame  /  at  image  coordinates  .  where 


“/p  =  ■/•  ("p-V 

These  equations  can  be  rewritten  as 

7p  =  j/(*p-V> 

(1) 

“/P  =  "’/■®P'"^/ 

where 

7p  " 

(2) 

• 

Xy=  -(V  y) 

>y=-(V  V 

(3) 

• 

II 

(4) 

3.2.  Decomposition 


All  of  the  feature  point  coordinates  (“yp-v^p)  are  entered  in  a  2FxP  measurement  matrix  w. 
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w  = 


“ll 

...  U^p 

“pi 

...  Ufrp 

‘'ll 

...  V,p 

Vi 

< 

■*)  • 
I'O 

(5) 


Each  column  of  the  measurement  matrix  contains  the  observations  for  a  single  point,  while 
each  row  contains  the  observed  u-coordinates  or  v-coordinates  for  a  single  frame.  Equation 
(2)  for  all  points  and  frames  can  now  be  combined  into  the  single  matrix  equation 


W  =  A/S+r[i  ...  i]  .  (6) 

where  M  is  the  2Fx3  motion  matrix  whose  rows  are  the  and  vectors,  5  is  the  3  x  P 
shape  matrix  whose  columns  are  the  vectors,  and  r  is  the  2Fx  i  translation  vector  whose 
elements  are  the  and  yy. 

Up  to  this  point,  Tomasi  and  Kanade  placed  no  restrictions  on  the  location  of  the  world  ori¬ 
gin,  except  that  it  be  stationary  with  respect  to  the  object.  Without  loss  of  generality,  they 
position  the  world  origin  at  the  center  of  mass  of  the  object,  denoted  by  c,  so  that 


c 


?  2  'r  =  »  • 

P-  I 


(7) 


Because  the  sum  of  any  row  of  S  is  zero,  the  sum  of  any  row  /  of  w  is  P7, .  This  enables 
them  to  compute  the  «'*  element  of  the  translation  vector  T  directly  from  w,  simply  by  aver¬ 
aging  the  i'*  row  of  the  measurement  matrix.  The  translation  is  the  subtracted  from  w,  leav¬ 
ing  a  “registered”  measurement  matrix  w*  =  W'-r[i  ...  i]  .Because  w*  is  the  product  of  a 
2Px  3  motion  matrix  M  and  a  3  x  p  shape  matrix  S,  its  rank  is  at  most  3.  When  noise  is 
present  in  the  input,  the  w*  will  not  be  exactly  of  rank  3,  so  the  Tomasi-Kanade  factoriza¬ 
tion  method  uses  the  SVD  to  find  the  best  rank  3  approximation  to  W* ,  factoring  it  into  the 
product 


W'*  =  ws . 


(8) 


3.3.  Normalization 

The  decomposition  of  equation  (8)  is  only  determined  up  to  a  linear  transformation.  Any 
non-singular  3x3  matrix  A  and  its  inverse  could  be  inserted  between  M  and  S,  and  their 
product  would  still  equal  w’ .  Thus  the  actual  motion  and  shape  are  given  by 

M  =  MA  5  =  A”'s  ,  (9) 

with  the  appropriate  3x3  invertible  matrix  A  selected.  The  correct  a  can  be  determined 
using  the  fact  that  the  rows  of  the  motion  matrix  M  (which  are  the  and  vectors)  repre- 


page? 

sent  the  camera  axes,  and  therefore  they  must  be  of  a  certain  form.  Since  and  are  unit 
vectors,  we  see  from  equation  (4)  that 

|m/  =1  !«»/=».  (10) 

and  because  they  are  orthogonal, 

=  0  .  (11) 

Equations  (10)  emd  (11)  give  us  3F  equations  which  we  call  the  metric  constraints.  Using 
these  constraints,  we  solve  for  the  3x3  matrix  a  which,  when  multiplied  by  M,  produces 
the  motion  matrix  M  that  best  satisfies  these  constraints.  Once  the  matrix  A  has  been  found, 
the  shape  and  motion  are  computed  from  equation  (9). 
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4.  The  Paraperspective  Factorization  Method 

The  Tomasi-Kanade  factorization  method  was  shown  to  be  computationally  inexpensive 
and  highly  accurate,  but  its  use  of  an  orthographic  projection  assumption  limited  the  meth¬ 
od’s  applicability.  For  example,  the  method  does  not  produce  accurate  results  when  there  is 
significant  translation  along  the  camera’s  optical  axis,  because  orthography  does  not  account 
for  the  fact  that  an  object  appears  larger  when  it  is  closer  to  the  camera.  We  must  model  this 
and  other  perspective  effects  in  order  to  successfully  recover  shape  and  motion  in  a  wider 
range  of  situations.  We  choose  an  approximation  to  perspective  projection  known  as  parap¬ 
erspective  projection,  which  was  introduced  by  Ohta  [5]  in  order  to  solve  a  shape  from  tex¬ 
ture  problem.  Although  the  paraperspective  projection  equations  are  more  complex  than 
those  for  orthography,  their  basic  form  is  the  same,  enabling  us  develop  a  method  analogous 
to  that  developed  by  Tomasi  and  Kanade. 


4.1.  Paraperspective  Projection 


Paraperspective  projection  closely  approximates  perspective  projection  by  modelling  both 
the  scaling  effect  (closer  objects  appear  larger  than  distant  ones)  and  the  position  effect 
(objects  in  the  periphery  of  the  image  are  viewed  from  a  different  angle  than  those  near  the 
center  of  projection  [1]),  while  retaining  the  linear  properties  of  orthographic  projection. 
The  paraperspective  projection  of  an  object  onto  an  image,  illustrated  in  Figure  3,  involves 
two  steps. 

1 .  An  object  point  is  projected  along  the  direction  of  the  line  connecting  the  focal  point 
of  the  camera  to  the  object’s  center  of  mass,  onto  a  hypothetical  image  plane  parallel  to 
the  real  image  plane  and  passing  through  the  object’s  center  of  mass. 

2.  The  point  is  then  projected  onto  the  real  image  plane  using  perspective  projection. 
Because  the  hypothetical  plane  is  parallel  to  the  real  image  plane,  this  is  equivalent  to 
simply  scaling  the  point  coordinates  by  the  ratio  of  the  camera  focal  length  and  the  dis¬ 
tance  between  the  two  planes.’ 

In  general,  the  projection  of  a  point  p  along  direction  r,  onto  the  plane  defined  by  its  normal 
n  and  distance  from  the  origin  d,  is  given  by  the  equation 


P' 


=  P- 


p  n 

- - r  . 

r  n 


(12) 


In  frame  /,  each  object  point  is  projected  along  the  direction  c  - (which  is  the  direction 
from  the  camera’s  focal  point  to  the  object’s  center  of  mass)  onto  the  plane  defined  by  nor¬ 
mal  and  distance  from  the  origin  c  k^.  The  result  of  this  projection  is 


1 .  The  scaled  orthographic  projection  model  (also  known  as  "weak  perspective”)  is  similar  to  paraperspective  projection, 
except  that  the  direction  of  the  initial  projection  in  step  I  is  parallel  to  the  camera's  optical  axis  rather  than  parallel  to  the 
line  connecting  the  object’s  center  of  mass  to  the  camera's  focal  point.  This  model  captures  the  scaling  effect  of  perspective 
projection,  but  not  the  position  effect.  (See  Appendix  I.) 
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Hypothetical 


Paraperspective  projection  in  two  dimensions 

Dotted  lines  indicate  true  perspective  projection 
indicate  parallel  lines. 


fP 


(s  ■  k,)  -  (c  ■  k,) 


(13) 


The  perspective  projection  of  this  point  onto  the  image  plane  is  given  by  subtracting  from 
s'jr^  to  give  the  position  of  the  point  in  the  camera’s  coordinate  system,  and  then  scaling  the 
result  by  the  ratio  of  the  camera’s  focal  length  /  to  the  depth  to  the  object’s  center  of  mass 
This  yields  the  coordinates  of  the  projection  in  the  image  plane. 


“/p  =  V)  7p  =  7  =  ('-V  • 

Substituting  (13)  into  (14)  and  simplifying  gives  the  general  paraperspective  equations  for 

7p  7p 


“//» =  r  { 
i 


■/- 


'/• 


il^kl 


if  (c-M 


(Sp-c)  +  (c-y  y) 
(Sp-c)  +  (c-t^)  •  j^} 


For  simplicity,  we  assume  unit  focal  length,  /  =  1 . 


(15) 


Without  loss  of  generality  we  can  simplify  our  equations  by  placing  the  world  origin  at  the 
object’s  center  of  mass  so  that  by  definition 


=  4  X  =  0 

p  ^  p 


(16) 


p=  I 
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This  reduces  (15)  to 


«/p  =  7{ 


if+ 


■f 
if  h 


These  equations  can  be  rewritten  as 


where 


Y 


*p-<y  y>} 

(17) 

•Sp-(VV} 

II 

+ 

(18) 

(19) 

(20) 

II 

(21) 

Notice  that  equation  (18)  has  a  form  identical  to  its  counterpart  for  orthographic  projection, 
equation  (2),  although  the  corresponding  definitions  of  >y,  and  differ.  This  enables 
us  to  perform  the  basic  decomposition  of  the  matrix  in  the  same  manner  that  Tomasi  and 
Kanade  did  for  the  orthographic  case. 


4.2.  Paraperspective  Decomposition 


We  can  combine  equation  (18),  for  all  points  p  from  1  to  P,  and  all  frames  /  from  1  to  F, 
into  the  single  matrix  equation 


r  n 

- 

*  - 

«,i  . 

•  “ip 

m, 

“n  • 

‘'ll  • 

•  “pp 

•  ‘'ip 

Til^ 

"l 

[s,  ...  Sp]  + 

^p 

>1 

‘'pp 

."f 

(22) 


or  in  short 


w  =  A/s  +  r[i  ...  i]  ,  (23) 

where  w  is  the  2Fx  P  measurement  matrix,  M  is  the  2Fx  3  motion  matrix,  S  is  the  3  x  p 
shape  matrix,  and  T  is  the  IF  x  i  translation  vector. 
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Using  equations  (16)  and  (18)  we  can  write 


p  p 

I  “/p=  I  (">/  Sp  +  V  = 

P  =  I  p  =  1 

p  p 


p 

“/  I  + 

p  =  1 

p 


(24) 


I  ‘'/p=  I  ("/  Sp''>/>  ="/  I  s-^^yr^yf 

p  =  1  p = 1  p = I 

Therefore  we  can  compute  Xj  and  >y,  which  are  the  elements  of  the  translation  vector  T, 
immediately  from  the  image  data  as 


1  1 

^f=  pi  %  y/  =  pi  ^fp  ■  (25) 

p = 1  p  =  I 

Once  we  know  the  translation  vector  T,  we  subtract  it  from  w,  giving  the  registered  mea¬ 
surement  matrix 


w*  =  W'-r[|  ...  i]  =  MS  .  (26) 

Since  w*  is  the  product  of  two  matrices  each  of  rank  at  most  3,  tv*  has  rank  at  most  3,  just 
as  it  did  in  the  orthographic  projection  case.  When  noise  is  present,  the  rank  of  tv*  will  not 
be  exactly  3,  but  by  computing  the  SVD  of  w*  and  only  retaining  the  largest  3  singular  val¬ 
ues,  we  can  factor  it  into 


W*  =  MS,  (27) 

where  M  is  a  2Fx  3  matrix  and  S  is  a  3  x  p  matrix.  Using  the  SVD  to  perform  this  factor¬ 
ization  guarantees  that  the  product  MS  is  the  best  possible  rank  3  approximation  to  w’ ,  in 
the  sense  that  it  minimizes  the  sum  of  squares  difference  between  corresponding  elements  of 
w*  and  MS. 


4.3.  Paraperspective  Normalization 

Just  as  in  the  orthographic  case,  the  decomposition  of  tv’  into  the  product  of  M  and  S  by 
equation  (27)  is  only  determined  up  to  a  linear  transformation  matrix  A .  Again,  we  deter¬ 
mine  this  matrix  A  by  observing  that  the  rows  of  the  motion  matrix  M  (the  and  vec¬ 
tors)  must  be  of  a  certain  form.  Taking  advantage  of  the  fact  that  i^,  j^,  and  are  unit 
vectors,  from  equation  (21)  we  observe  that 


I*"/  = 


1  +  Xr 


l"/  = 


(28) 


We  know  the  values  of  and  >y  from  our  initial  registration  step,  but  we  do  not  know  the 
value  of  the  depth  z^.  Thus  we  cannot  impose  individual  constraints  on  the  magnitudes  of 
and  as  was  done  in  the  orthographic  factorization  method.  Instead  we  adopt  the  following 
constraint  on  the  magnitudes  of  and 
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I"*/!'  _  I"/ 


1  +X~f 


i+y; 


(i 


(29) 


In  the  case  of  orthographic  projection,  one  constraint  on  and  n,  was  that  they  each  have 
unit  magnitude,  as  required  by  equation  (10).  In  the  above  paraperspective  case,  we  simply 
require  that  their  magnitudes  be  in  a  certain  ratio. 


There  is  also  a  constraint  on  the  angle  relationship  of  and  n^.  From  equation  (21),  and  the 
knowledge  that  j^,  and  ky  are  orthogonal  unit  vectors. 


m 


y  *■/ 


ly  JTyky  Jy "  .Vyky  Y  y 

n,  -  -  -  =  — — 


(30) 


The  problem  with  this  constraint  is  that,  again,  Zy  is  unknown.  We  could  use  either  of  the 
two  values  given  in  equation  (29)  for  \/zj ,  but  in  the  presence  of  noisy  input  data  the  two 
will  not  be  exactly  equal,  so  we  use  the  average  of  the  two  quantities.  We  choose  the  arith¬ 
metic  mean  over  the  geometric  mean  or  some  other  measure  in  order  to  keep  the  solution  of 
these  constraints  linear.  Thus  our  second  constraint  becomes 


I 

my  Hy  =  Y/2 


1"*/  ,  I"/ ) 

l+JCy  l+>y/ 


(31) 


This  is  the  paraperspective  version  of  the  orthographic  constraint  given  by  equation  (11), 
which  required  that  the  dot  product  of  my  and  ny  be  zero. 


Equations  (29)  and  (31)  are  homogeneous  constraints,  which  could  be  trivially  satisfied  by 
the  solution  V/  my  =  ny  =  0  ,  or  Af  =  0 .  To  avoid  this  solution,  we  impose  the  additional 
constraint 


|m,l  =  1  .  (32) 

This  does  not  effect  the  final  solution  except  by  a  scaling  factor. 

Equations  (29),  (31),  and  (32)  gives  us  2F+  i  equations,  which  are  the  paraperspective  ver¬ 
sion  of  the  metric  constraints.  We  compute  the  3x3  matrix  A  such  that  M  =  MA  best  satis¬ 
fies  these  metric  constraints  in  the  least  sum-of-squares  error  sense.  This  is  a  simple  problem 
because  the  constraints  are  linear  in  the  6  unique  elements  of  the  symmetric  3x3  matrix 
Q  =  A^A  .  We  use  the  metric  constraints  to  compute  Q,  compute  its  Jacobi  Transformation 
Q  =  LAL^  ,  where  A  is  the  diagonal  eigenvalue  matrix,  and  as  long  as  Q  is  positive  definite, 
A  =  (La'-^)  . 

4.4.  Paraperspective  Motion  Recovery 

Once  the  matrix  A  has  been  determined,  we  compute  the  shape  matrix  S  =  A“’s  and  the 
motion  matrix  M  =  MA  .  For  each  frame  /,  we  now  need  to  recover  the  camera  orientation 
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vectors  and  k/  from  the  vectors  and  which  are  the  rows  of  the  matrix  M.  From 
equation  (21)  we  see  that 

\f=z^f+x^f  jf=zfi^+y^f.  (33) 

From  this  and  the  knowledge  that  if,  ]f,  and  k/  must  be  orthonormal,  we  determine  that 

V^j/=  (z/n^+JCyk/)  X  {Zfif+y^f)  =  k/ 

|i/|  =  =  1  (34) 

|j/l  =  =  ' 

Again,  we  do  not  know  a  value  for  Zp  but  using  the  relations  specified  in  equation  (29)  and 
the  additional  knowledge  that  |ky|  =  1 ,  equation  (34)  can  be  reduced  to 

G^f=Hp  (35) 

where 


We  compute  kf  simply  as 

kf  =  GJ'Hj  (38) 

and  then  compute 

lf=nj^xkf  jf=kfXmf.  (39) 

There  is  no  guarantee  that  the  if  and  If  given  by  this  equation  will  be  orthonormal,  because 
mj  and  may  not  have  exactly  satisfied  the  metric  constraints.  Therefore  we  actually  use 
the  orthonormals  which  are  closest  to  the  if  and  j/  vectors  given  by  equation  (39).  Due  to  the 
arbitrary  world  coordinate  orientation,  to  obtain  a  unique  solution  we  then  rotate  the  com¬ 
puted  shape  and  motion  to  ali^n  the  world  axes  with  the  first  frame’s  camera  axes,  so  that 
ii  =  [i  0  (3  and  ji  =  [o  1  ^  . 

All  that  remain  to  be  computed  are  the  translations  for  each  frame.  We  calculate  the  depth  Zf 
from  equation  (29).  Since  we  know  Xp  y-p  Zp  if,  j/,  and  fe/,  we  can  calculate  if  using  equa¬ 
tions  (19)  and  (20). 
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4.5.  Solution  Ambiguity  Removal 

In  order  to  solve  for  the  shape  and  motion  at  the  end  of  Section  4.3.,  we  computed  the  matrix 
Q  =  A  that  best  satisfied  the  metric  constraints,  and  then  computed  A  from  Q.  There  is 

[±1  0  o1 

a  sign  ambiguity  in  this  process,  since  any  matrix  of  the  form  M  o  ±1  0  produces  the 

_0  0  ±1 

same  matrix  Q  when  multiplied  by  its  transpose.  Thus  there  are  actua  ly  several  equally 
plausible  motion  and  shape  matrices,  since  changing  the  sign  of  a  column  of  M  and  the  cor¬ 
responding  row  of  S  still  produces  a  solution  that  satisfies  the  ntetric  constraints  equally 
well.  This  sign  ambiguity  in  the  first  two  columns  of  M  was  removed  when  we  aligned  the 
world  coordinate  axes  with  the  first  frame’s  camera  axes,  at  the  end  of  Section  4.4.  How¬ 
ever,  the  ambiguity  in  the  third  column  of  M  and  the  third  row  of  S  is  a  genuine  ambiguity. 
There  are  two  equally  valid  solutions,  whose  shapes  differ  only  by  a  reflection  about  the  z- 


Geometrically,  this  ambiguity  arises  because  paraperspective  projection  does  not  account 
for  any  different  perspective  deformation  within  the  object  itself.  It  is  not  possible  to  distin¬ 
guish  the  “front”  of  the  object  from  the  “back”  of  the  object,  as  can  be  seen  from  Figure 
4(a).  However,  in  real  scenarios  there  will  be  additional  queues  due  to  occlusion  information 
as  in  Figure  4(b)  and  (c),  or  due  to  perspective  distortion  of  the  object,  as  in  Figure  4(d),  if 
the  object  is  not  too  distant  from  the  camera.  Simple  methods  based  on  either  of  these  phe¬ 
nomena  should  be  sufficient  to  determine  which  of  the  two  solutions  given  by  the  paraper¬ 
spective  factorization  method  is  consistent  with  the  image  data. 

4.6.  Paraperspective  Projection  as  an  Approximation  to  Perspective 
Projection 

In  Section  4.I.,  we  defined  paraperspective  projection  geometrically.  We  can  derive  the 
same  equations  mathematically  as  a  first-order  approximation  to  the  perspective  projection 
equations.  The  perspective  projection  of  point  p  onto  the  image  plane  in  frame  /  is  given  by 
^'fp)  >  where 

.  (40) 


For  simplicity  we  assume  unit  focal  length,  /  =  i . 


We  define  the  term 


Zj-  tjkj 


(42) 
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Figure  4  Ambiguity  of  Solution 

(a)  Sequence  of  images  with  two  valid  motion  and  shape  interpretations. 

(b) ,  (c)  Ambiguity  removed  due  to  occlusion  information  in  the  image  sequence, 
(d)  Ambiguity  removed  due  to  perspective  distortion  of  the  object  in  the  images 


and  then  compute  the  Taylor  series  expansion  of  the  above  equations  about  the  point 


yielding 


(43) 


IV- 


(Zfp-Zf)  +... 


^fp  Zr  ,2  9^2 

J  Zf 


_,)2^ 

3  9  ^ 

9 


(44) 


We  combine  equations  (41)  and  (42)  to  determine  that  Zjp-Zf  =  s^,  and  substitute  this 

into  equation  (44)  to  produce 


“/p“ 


'/■  9  '/■  9 


/  Zf  <./ 


,  L  (s  -U 

'(k/  V 


2 


Vp  7,  2  ^"y  v^2  .3  V 


(45) 


Ignoring  all  but  the  first  term  of  the  Taylor  series  yields  the  equations  for  scaled  orthogra¬ 
phic  projection  (See  Appendix  I.)  However,  instead  of  arbitrarily  stopping  at  the  first  term. 


we  eliminate  higher  order  terms  based  on  the  approximation  that  =  0,  which  will  be 

accurate  when  the  size  of  the  object  is  smaller  than  the  distance  of  the  object  from  the  cam¬ 
era.  Eliminating  these  terms  reduces  the  equation  (45)  to 


fP  7.  2  '“V 


^fP~ 


j/ 


(46) 


Y 


—  (k/  V 


Factoring  out  the  l and  expanding  the  dot-products  and 


j/  V 


(47) 


These  equations  are  equivalent  to  the  paraperspective  projection  equations  given  by  equa¬ 
tion  (17). 


The  approximation  that  |s^|^/z^  =  0  preserves  the  portion  of  the  second  term  of  the  Taylor 
series  expansion  of  order  (|Sp|  |tyj )  /z3,  while  ignoring  the  portion  of  the  second  term  of  order 
|Sp|'/z^  and  all  higher  order  terms.  Clearly  if  the  translation  that  the  object  undergoes  is  also 
small,  then  there  is  little  justification  for  preserving  this  portion  of  the  second  term  and  not 
the  other.  In  such  cases,  the  entire  second  term  can  be  safely  ignored,  leaving  only  the  equa¬ 
tions  for  scaled  orthographic  projection. 


Note  that  we  did  not  explicitly  set  the  world  origin  at  the  object’s  center  of  mass,  as  we  did 
in  Section  4.1.  However,  the  assumption  that  |Spj^/z^  =  0  will  be  most  accurate  when  the 
magnitudes  of  the  are  smallest.  Since  the  vectors  represent  the  vectors  from  the  world 
origin  to  the  object  points,  their  magnitudes  will  be  smaller  when  the  world  origin  is  located 
near  the  object’s  center  of  mass. 


{ 
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5.  Comparison  of  Methods  using  Synthetic  Data 

In  this  section  we  compare  the  performance  of  our  new  paraperspective  factorization 
method  with  the  previous  orthographic  factorization  method.  The  comparison  also  includes 
a  factorization  method  based  on  scaled  orthographic  projection  (also  known  as  “weak  per¬ 
spective”),  which  models  the  scaling  effect  of  perspective  projection  but  not  the  position 
effect,  in  order  to  demonstrate  the  importance  of  modelling  the  position  effect  for  objects  at 
close  range.  (See  Appendix  I.)  Our  results  show  that  the  paraperspective  factorization 
method  is  a  vast  improvement  over  the  orthographic  method,  and  underscore  the  importance 
of  modelling  both  the  scaling  and  position  effects. 

5.1.  Data  Generation 

The  synthetic  feature  point  sequences  used  for  comparison  were  created  by  moving  a  known 
“object”  -  a  set  of  3D  points  -  through  a  known  motion  sequence.  We  tested  three  different 
object  shapes,  each  containing  approximately  60  points.  Each  test  run  consisted  of  60  image 
frames  of  an  object  rotating  through  a  total  of  30  degrees  each  of  roll,  pitch,  and  yaw.  The 
“object  depth”  -  the  distance  from  the  camera’s  focal  point  to  the  front  of  the  object  -  in  the 
first  frame  was  varied  from  3  to  60  times  the  object  size.  In  each  sequence,  the  object  trans¬ 
lated  across  the  field  of  view  by  a  distance  of  one  object  size  horizontally  and  vertically,  and 
translated  away  from  the  camera  by  half  its  initial  distance  from  the  camera.  For  example, 
when  the  object’s  depth  in  the  first  frame  was  3.0,  its  depth  in  the  last  frame  was  4.5.  Each 
“image”  was  created  by  perspectively  projecting  the  3D  points  onto  the  image  plane,  for 
each  sequence  choosing  the  largest  focal  length  that  would  keep  the  object  in  the  field  of 
view  throughout  the  sequence.  The  coordinates  in  the  image  plane  were  perturbed  by  adding 
gaussian  noise,  to  model  tracking  imprecision.  The  standard  deviation  of  the  noise  was  2 
pixels  (assuming  a  512x512  pixel  image),  which  we  consider  to  be  a  rather  high  noise  level 
from  our  experience  processing  real  image  sequences.  For  each  combination  of  object, 
depth,  and  noise,  we  performed  three  tests,  using  different  random  noise  each  time. 

5.2.  Error  Measurement 

We  ran  each  of  the  three  factorization  methods  on  each  synthetic  sequence  and  measured  the 
rotation  error,  shape  error,  X-Y  offset  error,  and  Z  offset  (depth)  error.  The  rotation  error  is 
the  root-mean-square  (RMS)  of  the  size  in  radians  of  the  angle  by  which  the  computed  cam¬ 
era  coordinate  frame  must  be  rotated  about  some  axis  to  produce  the  known  camera  orienta¬ 
tion.  The  shape  error  is  the  RMS  error  between  the  known  and  computed  3D  point 
coordinates.  Since  the  shape  and  translations  are  only  determined  up  to  scaling  factor,  we 
first  scaled  the  computed  shape  by  the  factor  which  minimizes  this  RMS  error.  The  term 
“offset”  refers  to  the  translational  component  of  the  motion  as  measured  in  the  camera’s 
coordinate  frame  rather  than  in  world  coordinates;  the  X  offset  is  If  if,  the  Y  offset  is  \f  ]f, 
and  the  Z  offset  is  ty  k/.  The  X-Y  offset  error  and  Z  offset  error  are  the  RMS  error  between 
the  known  and  computed  offset;  like  the  shape  error,  we  first  scaled  the  computed  offset  by 
the  scale  factor  that  minimized  the  RMS  error.  Note  that  the  orthographic  factorization 


page  18 


method  supplies  no  estimation  of  translation  along  the  camera’s  optical  axis,  so  the  Z  offset 
error  cannot  be  computed  for  that  method. 


5.3.  Discussion  of  Results 


Depth  in  Isl  Frame 


Figure  5.  Methods  compared  for  a  typical  case 

noise  standard  deviation  =  2  pixels 


Depth  in  1st  Frame 


Figure  5  shows  the  average  errors  in  the  solutions  computed  by  the  various  methods,  as  a 
functions  of  object  depth  in  the  first  frame.  We  see  that  the  paraperspective  method  performs 
significantly  better  than  the  orthographic  factorization  method  regardless  of  depth,  because 
orthography  cannot  model  the  scaling  effect,  which  occurs  due  to  the  motion  along  the  cam¬ 
era’s  optical  axis.  The  figure  also  shows  that  the  paraperspective  method  performs  substan¬ 
tially  better  than  scaled  orthographic  method  at  close  range,  while  the  errors  from  the  two 
methods  are  nearly  the  same  when  the  object  is  distant.  This  confirms  the  importance  of 
modelling  the  position  effect  when  objects  are  near  the  camera. 
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In  other  experiments  in  which  the  object  was  centered  in  the  image  and  there  was  no  transla¬ 
tion  across  the  field  of  view,  the  paraperspective  method  and  the  scaled  orthographic  method 
p)erformed  equally  well,  as  we  would  expect  since  such  image  sequences  contain  no  position 
effects.  Similarly,  we  found  that  when  the  object  remained  centered  in  the  image  and  there 
was  no  depth  translation,  the  orthographic  factorization  method  performed  very  well,  and 
the  paraperspective  factorization  method  provided  no  significant  improvement  since  such 
sequences  contain  neither  scaling  effects  nor  position  effects. 

To  examine  the  impact  of  those  effects  of  perspective  projection  which  are  not  modelled  by 
paraperspective  projection,  we  implemented  an  iterative  method  which  refines  the  results  of 
the  paraperspective  method  using  a  perspective  projection  model.  (See  Appendix  II.)  Com¬ 
puting  this  refined  solution  required  more  than  ten  times  as  long  as  computing  the  initial 
paraperspective  results.  We  tested  the  method  on  the  same  sequences  that  were  tested  to  pro¬ 
duce  Figure  5,  and  found  the  resulting  shape  and  motion  errors  to  be  nearly  invariant  to 
depth.  The  perspective  refinement  method  only  minimally  improved  the  motion  results  over 
the  paraperspective  solution.  However,  the  shape  results  were  significantly  improved  for 
those  cases  in  which  the  depth  was  less  than  30.  This  implies  that  unmodelled  perspective 
distortion  in  the  images  effects  primarily  the  shape  portion  of  the  paraperspective  factoriza¬ 
tion  method’s  solution,  and  that  the  effects  are  significant  only  when  the  object  is  within  a 
certain  distance  of  the  camera. 


5.4.  Analysis  of  Paraperspective  Method  using  Synthetic  Data 

Now  that  we  have  shown  the  advantages  of  the  paraperspective  factorization  method  over 
the  previous  method,  we  further  analyze  the  performance  of  the  paraperspective  method  to 
determine  its  behavior  at  various  depths  and  its  robustness  with  respect  to  noise.  The  syn¬ 
thetic  sequences  used  in  these  experiments  were  created  in  the  same  manner  as  in  the  previ¬ 
ous  section,  except  that  the  standard  deviation  of  the  noise  was  varied  from  0  to  4.0  pixels. 

In  Figure  6,  we  see  that  at  high  depth  values,  the  error  in  the  solution  is  roughly  proportional 
to  the  level  of  noise  in  the  input,  while  at  low  depths  the  error  is  inversely  related  to  the 
depth.  This  occurs  because  at  low  depths,  perspective  distortion  of  the  object’s  shape  is  the 
primary  source  of  error  in  the  computed  results.  At  higher  depths,  perspective  distortion  of 
the  object’s  shape  is  negligible,  and  noise  becomes  the  dominant  cause  of  error  in  the 
results.  For  example,  at  a  noise  level  of  1  pixel,  the  rotation  and  XY-offset  errors  are  nearly 
invariant  to  the  depth  once  the  object  is  farther  from  the  camera  than  10  times  the  object 
size.  The  shape  results,  however,  appear  sensitive  to  perspective  distortion  even  at  depths  of 
30  or  60  times  the  object  size. 


Rotation  Error  x  lOe-3 
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Figure  6.  Paraperspective  shape  and  motion 
recovery  by  noise  level 


6.  Accommodating  Occlusion  and 
Uncertain  IVacking  Data 
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So  far,  we  have  assumed  that  all  of  the  entries  of  the  measurement  matrix  are  known  and  are 
equally  reliable.  In  real  image  sequences,  this  is  not  the  case.  Some  feature  points  are  not 
tracked  throughout  the  entire  sequence  because  they  leave  the  field  of  view,  become 
occluded,  or  change  their  appearance  significantly  enough  that  the  feature  tracker  can  no 
longer  track  them.  As  the  object  moves,  new  feature  points  can  be  detected  on  parts  of  the 
object  which  were  not  visible  in  the  initial  image.  Thus  the  measurement  matrix  w  is  not 
entirely  filled;  there  are  some  pairs  (/,  p)  for  which  was  not  observed. 

Not  all  observed  position  measurements  are  known  with  the  same  confidence.  Some  feature 
windows  contain  sharp,  trackable  features,  enabling  exact  localization,  while  others  may 
have  significantly  less  texture  information.  Some  matchings  are  very  exact,  while  others  are 
less  exact  due  to  unmodelled  change  in  the  appearance  of  a  feature.  Previously,  using  some 
arbitrary  criteria,  each  measurement  was  either  accepted  or  rejected  as  too  unreliable,  and 
then  all  accepted  observations  were  treated  equally  throughout  the  method. 

We  address  both  of  these  issues  by  assigning  a  confidence  value  to  each  measurement,  and 
modifying  our  method  to  weight  each  measurement  by  its  corresponding  confidence  value. 
If  a  feature  is  not  observed  in  some  frames,  the  confidence  values  corresponding  to  those 
measurements  are  set  to  zero. 

6.1.  Confidence- Weighted  Formulation  and  Solution  Method 

We  can  view  the  decomposition  step  of  Section  4.2.  as  a  way,  given  the  measurement  matrix 
w,  to  compute  an  Aif,  S,  and  T  that  satisfy  the  equation 

W  =  M5  +  r[|  ...  i]  (48) 

There,  our  first  step  was  to  compute  T  directly  from  w.  We  then  used  the  SVD  to  factor  the 
matrix  w*  =  w-  r[i  ...  j]  into  the  product  of  M  and  S.  In  fact,  using  the  SVD  to  perform 
this  factorization  produced  an  M  and  5  that,  given  tv  and  the  T  just  computed  from  tv,  min¬ 
imized  the  error 


2F  P  2 

^0  ~  S  S  (^rp~  ^^r\S\p  +  Mr2^2p  +  MriS2p+T^))  (49) 

r  =  1/7  =  I 

In  our  new  confidence-weighted  formulation,  we  associate  each  element  of  the  measure¬ 
ment  matrix  tv^^  with  a  confidence  value  .  We  incorporate  these  confidences  into  the 
factorization  method  by  reformulating  the  decomposition  step  as  a  weighted  least  squares 
problem,  in  which  we  seek  the  A/,  S,  and  T  which  minimize  the  error 
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2F  p  2 

£=  I  ly^rp(^rp-(‘^rlSlp  +  Mr2S2p  +  Mrihp+T^))  (50) 

r  =  Ip  =  ] 

Once  we  have  solved  this  minimization  problem  for  M,  S,  and  T,  we  can  proceed  with  the 
normalization  step  and  the  rest  of  the  shape  and  motion  recovery  process  precisely  as 
before. 

There  is  sufficient  mathematical  constraint  to  compute  M,  S,  and  T,  provided  the  number  of 
known  elements  of  w  exceeds  the  total  number  of  variables  iSF  +  3P).  The  minimization  of 
equation  (50)  is  a  nonlinear  least  squares  problem  because  each  term  contains  products  of 
the  variables  Mrj  and  Sjp .  However,  it  is  separable;  the  set  of  variables  can  be  partitioned 
into  two  subsets,  the  motion  variables  and  the  shape  variables,  so  that  the  problem  becomes 
a  simple  linear  least  squares  problem  with  respect  to  one  subset  when  the  other  is  known. 
We  use  a  variant  of  an  algorithm  suggested  by  Ruhe  and  Wedin  [8]  for  solving  such  prob¬ 
lems  -  one  that  is  equivalent  to  alternately  refining  the  two  sets  of  variables.  In  other  words, 
we  hold  S  fixed  at  some  value  and  solve  for  the  M  and  T  which  minimize  e.  Then  holding 
M  and  T  fixed,  we  solve  for  the  S  which  minimizes  e .  Each  step  of  the  iteration  is  simple 
and  fast  since,  as  we  will  show  shortly,  it  is  a  series  of  linear  least  squares  problems.  We 
repeat  the  process  until  a  step  of  the  iteration  produces  no  significant  reduction  in  the  error 
e. 

To  compute  M  and  T  for  a  given  S,  we  first  rewrite  the  minimization  of  equation  (50)  as 

2F  P  ...  2 

e=  X  ’  where  ^  ‘f^rp^^rp~  +  •  (51) 

r  =  I  p  =  I 

For  a  fixed  5,  the  total  error  e  can  be  minimized  by  independently  minimizing  each  error  e^, 
since  no  variable  appears  in  more  than  one  equation.  Each  describes  a  weighted  linear 
least  squares  problem  in  the  variables  A/ri  >  Mrz,  Mri,  and  T^.  For  every  row  r,  the  four 
variables  are  computed  by  finding  the  least  squares  solution  to  the  overconstrained  linear 
system  of  4  variables  and  P  equations 

r.  .  .  1  Mh  r  n 

ShYh  531Yh  Y,,  W^,Y,, 

"'■2  =  _  (52) 

Sipy^phpy.phpyrpyrpj  rrpyrP 

Similarly,  for  a  fixed  M  and  T,  each  column  of  5  can  be  computed  independently  of  the 
other  columns,  by  finding  the  least  squares  solution  to  the  linear  system  of  3  variables  and 
2F  equations 
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W,,Y„  M,3Y,p 


Mzf.  lYi/T  p  M2F.l'i2F.p  ^2f.3Y2f 


^Ip 

^2p 

L^3p, 


^^2F.p~  ^2F^hF.p_ 


(53) 


As  in  any  iterative  method,  an  initial  value  is  needed  to  begin  the  process.  Our  initial  exper¬ 
iments  showed,  however,  that  when  the  confidence  matrix  contained  few  zero  values,  the 
method  consistently  converged  to  the  correct  solution  in  a  small  number  of  iterations,  even 
beginning  with  a  random  initial  value.  For  example,  in  the  special  case  of  y,.  =  '  0  (which 
corresponds  to  the  case  in  which  all  features  are  tracked  throughout  the  entire  sequence  and 
are  known  with  equal  confidence),  the  method  always  converged  in  5  or  fewer  iterations, 
requiring  even  less  time  than  computing  the  singular  value  decomposition  of  w;  when  the 
Y^^  were  randomly  given  values  ranging  from  1  to  10,  the  method  generally  converged  in  10 
or  fewer  iterations;  and  with  a  y  whose  fill  fraction  (fraction  of  non-zero  entries)  was  0.8, 
the  method  converged  in  20  or  fewer  iterations.  However,  when  the  fill  fraction  decreased  to 
0.6,  the  method  sometimes  failed  to  converge  even  after  100  iterations. 


In  order  to  apply  the  method  to  sequences  with  lower  fill  fractions,  it  is  critical  that  we 
obtain  a  reasonable  initial  value  before  proceeding  with  the  iteration.  We  developed  an 
approach  analogous  to  the  propagation  method  described  in  [11].  We  first  find  a  subset  of 
rows  and  columns  of  W  for  which  all  of  the  confidences  are  non-zero,  and  solve  for  the  cor¬ 
responding  rows  of  M  and  T  and  columns  of  S  by  running  the  iterative  method  starting  with 
a  random  initial  value.  As  indicated  above,  this  converges  quickly,  producing  estimated  val¬ 
ues  for  this  subset  of  M,  S,  and  T.  We  can  solve  for  one  additional  row  of  M  and  T  by  solv¬ 
ing  the  linear  least  squares  problem  of  equation  (52)  using  only  the  known  columns  of  5. 
We  can  solve  for  one  additional  column  of  5  by  solving  the  linear  least  squares  problem  of 
equation  (53)  using  only  the  known  rows  of  M  and  T.  We  continue  solving  for  additional 
rows  of  M  and  T  or  columns  of  S  until  M,  s,  and  T  are  completely  known.  Using  this  as  the 
initial  value  allows  the  iterative  method  to  converge  in  far  fewer  iterations. 


6.2.  Analysis  of  Weighted  Factorization  using  Synthetic  Data 

We  tested  the  confidence-weighted  paraperspective  factorization  method  with  artificially 
generated  measurement  matrices  and  confidence  matrices  whose  fill  fraction  (fraction  of 
non-zero  entries)  was  varied  from  1.0  down  to  0.2*.  Figure  7  shows  how  the  performance 
degrades  as  the  noise  level  increases  and  fill  fraction  decreases.  The  synthetic  sequences 
were  of  a  single  object  consisting  of  99  points,  initially  located  60  times  the  object  size  from 
the  camera.  Each  data  point  represents  the  average  solution  error  over  5  runs,  using  a  differ¬ 
ent  seed  for  the  random  noise.  The  motion,  method  of  generating  the  sequences,  and  error 
measures  are  as  described  in  Section  5. 

Errors  in  the  recovered  motion  only  increase  slightly  as  the  fill  fraction  decreases  from  1 .0 


1 .  Creating  a  confidence  matrix  that  has  a  given  fill  fraction  and  a  fairly  realistic  fill  pattern  (the  arrangement  of  zero  and 
non-zero  entries  in  the  matrix)  is  a  non-trivial  task.  We  first  create  the  synthetic  feature  tracks  resulting  from  the  given 
object  ana  motion,  by  detecting  when  points  become  occluded  by  some  surface  of  the  object.  These  feature  tracks  are  then 
extended  or  shortened  until  the  desired  fill  fraction  is  achieved. 
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Fill  Fraction  Fill  Fraction 


Fill  Fraction  Fill  Fraction 

Figure  7.  Confidence- Weighted  Shape  and 
Motion  Recovery  by  Noise  Level,  Fill  Fraction 

to  0.5.  At  a  very  low  noise  level,  such  as  0.1  pixels,  this  behavior  continues  down  to  a  fill 
fraction  of  0.3.  When  the  fill  fraction  is  decreased  below  this  range,  however,  the  error  in  the 
recovered  motion  increases  very  sharply.  The  shape  results  appear  more  sensitive  to 
decreased  fill  fractions;  as  the  fill  fraction  drops  to  0.8  or  lower,  the  shape  error  increases 
sharply,  and  then  increases  dramatically  when  the  fill  fraction  reaches  0.6  or  0.5.  While  the 
system  of  equations  defined  by  equation  (50)  is  still  overconstrained  at  these  lower  fill  frac- 
tions,  apparently  there  is  insufficient  redundancy  to  overcome  the  effects  of  the  noise  in  the 
data. 
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7.  Shape  and  Motion  Recovery  from 
Real  Image  Sequences 

We  tested  the  paraperspective  factorization  method  on  two  real  image  sequences  -  a  labora¬ 
tory  experiment  in  which  a  small  model  building  was  imaged,  and  an  aerial  sequence  taken 
from  a  low-altitude  plane  using  a  hand-held  video  camera.  Both  sequences  contain  signifi¬ 
cant  perspective  effects,  due  to  translations  along  the  optical  axis  and  across  the  field  of 
view.  We  implemented  a  system  to  automatically  identify  and  track  features,  based  on 
[11]  and  [3].  This  tracker  computes  the  position  of  a  square  feature  window  by  minimizing 
the  sum  of  the  squares  of  the  intensity  difference  over  the  feature  window  from  one  image  to 
the  next. 

7.1.  Hotel  Model  Sequence 

A  hotel  model  was  imaged  by  a  camera  mounted  on  a  computer-controlled  movable  plat¬ 
form.  The  camera  motion  included  substantial  translation  away  from  the  camera  and  across 
the  field  of  view  (see  Figure  8).  The  feature  tracker  automatically  identified  and  tracked  197 
points  throughout  the  sequence  of  181  images. 


Frame  121  Frame  151 

Figure  8.  Hotel  Model  Image  Sequence 


Both  the  paraperspective  factorization  method  and  the  orthographic  factorization  method 
were  tested  with  this  sequence.  The  shape  recovered  by  the  orthographic  factorization 
method  was  rather  deformed  (see  Figure  9)  and  the  recovered  motion  incorrect,  because  the 
method  could  not  account  for  the  scaling  and  position  effects  which  are  prominent  in  the 
sequence.  The  paraperspective  factorization  method,  however,  models  these  effects  of  per¬ 
spective  projection,  and  therefore  produced  an  accurate  shape  and  accurate  motion. 

Several  features  in  the  sequence  were  poorly  tracked,  and  as  a  result  their  recovered  3D 
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Orthographic  Solution  Parapers|^tive  Solution  * 

Figure  9.  Comparison  of  Orthographic  and 
Paraperspective  Shape  Results 

positions  were  incorrect.  While  they  did  not  disrupt  the  overall  solution  greatly,  we  found 
that  we  could  achieve  improved  results  by  automatically  removing  these  features  in  the  fol¬ 
lowing  manner.  Using  the  recovered  shape  and  motion,  we  computed  the  reconstructed  mea¬ 
surement  matrix  ,  and  then  eliminated  from  w  those  features  for  which  the  average 

error  between  the  elements  of  w  and  was  more  than  twice  the  average  such  error. 

We  then  ran  the  shape  and  motion  recovery  again,  using  only  the  remaining  179  features. 

Eliminating  the  poorly  tracked  features  decreased  errors  in  the  recovered  rotation  about  the 
camera’s  x-axis  in  each  frame  by  an  average  of  0.5  degrees,  while  the  errors  in  the  other 
rotation  parameters  were  also  slightly  improved.  The  final  rotation  values  are  shown  in  Fig¬ 
ure  10,  along  with  the  values  we  measured  using  the  camera  platform.  The  computed  rota¬ 
tion  about  the  camera  x-axis,  y-axis,  and  z-axis  was  always  within  0.29  degrees,  1.78 
degrees,  and  0.45  degrees  of  the  measured  rotation,  respectively. 

7.2.  Aerial  Image  Sequence 

An  aerial  image  sequence  was  taken  from  a  small  airplane  overflying  a  suburban  Pittsburgh 
residential  area  adjacent  to  a  steep,  snowy  valley,  using  a  small  hand-held  video  camera. 

The  plane  altered  its  altitude  during  the  sequence  and  also  varied  its  roll,  pitch,  and  yaw 
slightly.  Several  images  from  the  sequence  are  shown  in  Figure  11. 

Due  to  the  bumpy  motion  of  the  plane  and  the  instability  of  the  hand-held  camera,  features 
often  moved  by  as  much  as  30  pixels  from  one  image  to  the  next.  The  original  feature 
tracker  could  not  track  motions  of  more  than  approximately  3  pixels,  so  we  implemented  a 
coarse-to-fine  tracker.  The  tracker  first  estimated  the  translation  using  low  resolution 
images,  and  then  refined  that  value  using  the  same  methods  as  the  initial  tracker. 

The  sequence  covered  a  long  sweep  of  terrain,  so  none  of  the  features  were  visible  through-  v. 

out  the  entire  sequence.  As  some  features  left  the  field  of  view,  new  features  were  automati¬ 
cally  detected  and  added  to  the  set  of  features  being  tracked.  A  vertical  bar  in  the  fill  pattern 
(shown  in  Figure  11)  indicates  the  range  of  frames  through  which  a  feature  was  successfully 
tracked.  Each  observed  data  measurement  was  assigned  a  confidence  value  based  on  the 
gradient  of  the  feature  and  the  tracking  residue.  A  total  of  1026  points  were  tracked  in  the 
108  image  sequence,  with  each  point  being  visible  for  an  average  of  30  frames  of  the 
sequence. 
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Figure  10.  Hotel  Model  Rotation  Results 


The  confidence-weighted  paraperspective  factorization  method  was  used  to  recover  the 
shape  of  the  terrain  and  the  motion  of  the  airplane.  Two  views  of  the  reconstructed  terrain 
map  are  shown  in  Figure  12.  While  no  ground-truth  was  available  for  the  shape  or  the 


Two  views  of  reconsuucted  terrain 


•j  Figure  12.  Reconstructed  Terrain 

motion,  we  observed  that  the  terrain  was  qualitatively  correct,  capturing  the  flat  residential 
•  area  and  the  steep  hillside  as  well,  and  that  the  recovered  positions  of  features  on  buildings 

were  elevated  from  the  surrounding  terrain. 
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8.  Conclusions 

The  principle  that  the  measurement  matrix  has  rank  3,  as  put  forth  by  Tomasi  and  Kanade  in 
[10],  was  dependent  on  the  use  of  an  orthographic  projection  model.  We  have  shown  in  this 
paper  that  this  important  result  also  holds  for  the  case  of  paraperspective  projection,  which 
closely  approximates  perspective  projection.  We  have  devised  a  paraperspective  factoriza¬ 
tion  method  based  on  this  model,  which  uses  different  metric  constraints  and  motion  recov¬ 
ery  techniques,  but  retains  many  of  the  features  of  the  original  factorization  method. 

In  image  sequences  in  which  the  object  being  viewed  translates  significantly  toward  or  away 
from  the  camera  or  across  the  camera’s  field  of  view,  the  paraperspective  factorization 
method  performs  significantly  better  than  the  orthographic  method.  The  paraperspective  fac¬ 
torization  method  also  computes  the  distance  from  the  camera  to  the  object  in  each  image 
and  can  accommodate  missing  or  uncertain  tracking  data,  which  enables  its  use  in  a  variety 
of  applications. 

The  C  implementation  of  the  paraperspective  factorization  method  required  about  20-24 
seconds  to  solve  a  system  of  60  frames  and  60  points  on  a  Sun  4/65,  with  most  of  this  time 
spent  computing  the  singular  value  decomposition  of  the  measurement  matrix.  Running 
times  for  the  confidence-weighted  method  were  comparable,  but  varied  depending  on  the 
number  of  iterations  required  for  the  method  to  converge.  While  this  is  not  sufficient  for 
real-time  use,  we  hope  to  develop  a  faster  implementation. 

The  confidence-weighted  factorization  method  performs  well  when  the  fill  fraction  of  the 
confidence  matrix  is  high  or  when  the  noise  level  is  very  low.  In  future  work  we  hope  to 
determine  more  precisely  in  what  circumstances  this  method  can  be  expected  to  perform 
well,  and  to  investigate  ways  to  extend  its  range  so  that  the  method  can  be  applied  to  longer 
sequences  in  which  the  fill  fraction  is  much  lower. 
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Appendix  I.  The  Scaled  Orthographic 
Factorization  Method 

Scaled  orthographic  projection,  also  known  as  “weak  perspective”  [4],  is  a  closer  approxi- 
^  mation  to  perspective  projection  than  orthographic  projection,  yet  not  as  accurate  as  parap- 

erspective  projection.  It  models  the  scaling  effect  of  perspective  projection,  but  not  the 
position  effect.  The  scaled  orthographic  factorization  method  can  be  used  when  the  object 
remains  centered  in  the  image,  or  when  the  distance  to  the  object  is  lai^e  relative  to  the  size 
of  the  object. 

I.l.  Scaled  Orthographic  Projection 

Under  scaled  orthographic  projection,  object  points  are  orthographical ly  projected  onto  a 
hypothetical  image  plane  parallel  to  the  actual  image  plane  but  passing  through  the  object’s 
center  of  mass  c.  This  image  is  then  projected  onto  the  image  plane  using  perspective  pro¬ 
jection  (see  Figure  13). 


Hypothetical 


Figure  13 

Scaled  Orthographic  Projection  in  two  dimensions 
Dotted  lines  indicate  true  perspective  projection 
indicate  parallel  lines. 

Because  the  perspectively  projected  points  all  lie  on  a  plane  parallel  to  the  image  plane,  they 
all  lie  at  the  same  depth 

2y  =  (c-tj)  k^.  (54) 

Thus  the  scaled  orthographic  projection  equations  are  very  similar  to  the  orthographic  pro¬ 
jection  equations,  except  that  the  image  plane  coordinates  are  scaled  by  the  ratio  of  the  focal 
length  to  the  depth 
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/  (55) 

V 

To  simplify  the  equations  we  assume  unit  focal  length,  /  =  i .  The  world  origin  is  arbitrary, 
so  we  fix  it  at  the  object’s  center  of  mass,  so  that  c  =  0,  and  rewrite  the  above  equations  as 


where 


=  “/ 

(56) 

-yk^ 

(57) 

II 

1 

II 

(58) 

m  _  '/ 

m,  =  — 

•.^1 

II 

(59) 

i 


1.2.  Decomposition 

Because  equation  (56)  is  identical  to  equation  (2),  the  measurement  matrix  w  can  still  be 
written  as  )V  =  A/5  +  T"  Just  as  in  orthographic  and  paraperspective  cases.  We  still  compute  jy 
and  >y  immediately  from  the  image  data  using  equation  (25),  and  use  singular  value  decom¬ 
position  to  factor  the  registered  measurement  matrix  w‘  into  the  product  of  M  and  S. 

1.3.  Normalization 

Again,  the  decomposition  is  not  unique  and  we  must  determine  the  3x3  matrix  A  which 
produces  the  actual  motion  matrix  M  =  MA  and  the  shape  matrix  S  =  /f's.  From  equation 
(59), 


I*"/  =  \  I"/  =  4 

Zf  Zf 

We  do  not  know  the  value  of  the  depth  Zj,  so  we  cannot  impose  individual  constraints  on  ny 
and  as  we  did  in  the  orthographic  case.  Instead,  we  combine  the  two  equations  as  we  did 
in  the  paraperspective  case,  to  impose  the  constraint 

I™/  =  I"/-  (61) 

Because  and  are  just  scalar  multiples  of  y  and  j^,  we  can  still  use  the  constraint  that 

iny^n^  =  0.  (62) 

As  in  the  paraperspective  case,  equations  (61)  and  (62)  are  homogeneous  constraints,  which 
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could  be  trivially  satisfied  by  the  solution  A#  =  0,  so  to  avoid  this  solution  we  add  the  con¬ 
straint  that 


jin,|  =  1 .  (63) 

Equations  (61),  (62),  and  (63)  are  the  scaled  orthographic  version  of  the  metric  constraints. 
We  can  compute  the  3x3  matrix  a  which  best  satisfies  them  very  easily,  because  the  con¬ 
straints  are  linear  in  the  6  unique  elements  of  the  symmetric  3x3  matrix  Q  =  fiJ A . 

1.4.  Shape  and  Motion  Recovery 

Once  the  matrix  a  has  been  found,  the  shape  is  computed  as  S  =  a"'s.  We  compute  the 
motion  parameters  as 


(64) 


Unlike  the  orthographic  case,  we  can  now  compute  zj,  the  component  of  translation  along 
the  camera’s  optical  axis,  from  equation  (60). 


4 
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Appendix  11.  Perspective  Method 

This  section  presents  an  iterative  method  used  to  recover  the  shape  and  motion  using  a  per¬ 
spective  projection  model.  Although  our  algorithm  was  developed  independently  and  han¬ 
dles  the  full  three  dimensional  case,  this  method  is  quite  similar  to  a  two  dimensional 
algorithm  developed  by  Taylor,  Kriegman,  and  Anandan  as  reported  in  [9].  * 

II.l.  Perspective  Projection  y 

In  the  perspective  projection  model,  sometimes  referred  to  as  the  pinhole  camera  model, 
object  points  are  projected  directly  towards  the  focal  point  of  the  camera.  An  object  point’s 
image  coordinates  are  determined  by  the  position  at  which  the  line  connecting  the  object 
point  with  the  camera’s  focal  point  intersects  the  image  plane,  as  illustrated  in  Figure  14. 


Simple  geometry  using  similar  triangles  produces  the  perspective  projection  equations 


fP  kf 


(65) 


We  assume  unit  focal  length,  and  rewrite  the  equations  in  the  form 


Ur  =  - - - - - 

fP  kfS,  +  Zf 

j/S/7+>y 

V.  =  — i - i 

fP  k,s 


(66) 


where 
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■V"  '/■*/  j/  V  V 

II.2.  Iterative  Minimization  Method 


The  above  equations  are  non-linear  in  the  shape  and  motion  variables.  There  is  no  apparent 
way  to  combine  the  equations  for  all  points  and  frames  into  a  single  matrix  equation,  to  sep¬ 
arate  the  shape  from  the  motion,  or  to  compute  the  translational  components  directly  from 
the  image  measurements,  as  we  did  in  the  orthographic  and  paraperspective  cases.  Instead 
we  formulate  the  problem  as  an  overconstrained  non-linear  least  squares  problem  in  the 
motion  and  shape  variables,  in  which  we  seek  to  minimize  the  error 


F  p 

I  I  { 

/=  i/»=  I 


Vp  k 


J/Sp  +  >/ 


} . 


(68) 


In  the  above  formulation,  there  appear  to  be  12  motion  variables  loi  each  frame,  since  each 
image  frame  is  defined  by  three  orientation  vectors  and  a  translation  vector.  In  reality,  since 
1^,  j^,  and  ky  must  be  orthonormal  vectors,  they  can  be  written  as  functions  of  only  three 
independent  rotational  parameters  p^,  and  y. 


coso^osP^  (coso^inP^in-j^-  sinay:osiy) 
sino^osp^  ( sina^inP^inYy+  coso^osy^) 
-sinP^  cosp^iny^ 


(cosa^inPyCosy^+  sina^iny^) 
(sino^inp^osy^-  coso^iny^ 
cosP^cosy^ 


(69) 


Therefore  we  have  six  motion  parameters,  x^,  jy,  z^,  o^,  p^,  and  for  each  frame,  and  three 
shape  parameters,  for  each  point.  Equation  (66)  defines  an  overconstrained 

set  of  2FP  equations  in  Hiese  6F+  'iP  variables,  and  we  carry  out  the  minimization  of  equa¬ 
tion  (68)  with  \p  ip  and  defined  by  equation  (69)  as  functions  of  o^,  p^,  and  y^. 


We  could  in  theory  apply  any  one  of  a  number  of  non-linear  equation  solution  techniques  to 
this  problem.  Such  methods  begin  with  a  set  of  initial  variable  values,  and  iteratively  refine 
those  values  to  reduce  the  error.  We  know  the  mathematical  form  of  the  equations,  so  we  can 
use  derivative  information  to  guide  our  numerical  search.  However,  general  non-linear  least 
square  techniques  would  not  take  full  advantage  of  the  structure  of  our  equations.  The  Lev- 
enberg-Marquardt  technique  [7]  would  require  the  creation  and  inversion  of  a 
(6F+  3/*)  X  (6F+  3P)  matrix  at  each  step  of  the  iteration.  This  is  unacceptably  slow,  since 
we  often  use  hundreds  of  points  and  frames. 


Our  method  takes  advantage  of  the  particular  structure  of  the  equations  by  separately  refin¬ 
ing  the  shape  and  motion  parameters.  We  hold  the  shape  constant  and  solve  for  the  motion 
parameters  which  minimize  the  error.  We  then  hold  the  motion  constant,  and  solve  for  the 
shape  parameters  which  minimize  the  error.  We  repeat  this  process  until  an  iteration  pro¬ 
duces  no  significant  reduction  in  the  total  error  e. 


While  holding  the  shape  constant,  the  minimization  with  respect  to  the  motion  variables  can 
be  performed  independently  for  each  frame.  This  minimization  requires  solving  an  overcon- 
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strained  system  of  6  variables  in  P  equations.  Likewise  while  holding  the  motion  constant, 
we  can  solve  for  the  shape  separately  for  each  point  by  solving  a  system  of  2F  equations  in 
3  variables.  This  not  only  reduces  the  problem  to  manageable  complexity,  but  as  pointed  out 
in  [9],  it  lends  itself  well  to  parallel  implementation. 

We  perform  the  individual  minimizations,  fitting  6  motion  variables  to  P  equations  or  fitting 
3  shape  variables  to  IF  equations,  using  the  Levenberg-Marquardt  method  [7],  a  method 
which  uses  steepest  descent  when  far  from  the  minimum  and  varies  continuously  towards 
the  inverse-Hessian  method  as  the  minimum  is  approached.  Each  step  of  the  iteration 
requires  P  inversions  of  6  x  6  matrices  and  IF  inversions  of  3  x  3  matrices. 

We  do  not  actually  need  to  vary  all  6f  +  3P  variables,  since  the  solution  is  only  determined 
up  to  a  scaling  factor,  the  world  origin  is  arbitrary,  and  the  world  coordinate  orientation  is 
arbitrary.  We  could  choose  to  arbitrarily  fix  each  of  the  first  frame’s  rotation  variables  at  zero 
degrees,  and  similarly  fix  some  shape  or  translation  parameters.  However,  experimentally 
we  found  that  the  algorithm  converges  significantly  faster  when  the  shape  and  motion 
parameters  are  all  allowed  to  vary.  Once  the  algorithm  has  converged  to  a  solution,  we 
adjust  the  final  shape  and  translation  to  place  the  origin  at  the  object’s  center  of  mass,  scale 
the  solution  so  that  the  depth  in  the  first  frame  is  1.0,  and  rotate  the  solution  so  that 
ii  =  [l  0  and  ji  =  [o  1  or  equivalently,  so  that  a,  =  p,  =  r,  =  0. 

One  problem  with  any  iterative  method  is  that  the  final  result  can  be  highly  dependent  on  the 
initial  values.  Taylor,  Kriegman,  and  Anandan  [9]  require  some  basic  odometry  measure¬ 
ments  as  might  be  produced  by  a  navigation  system  to  use  as  initial  values  for  their  motion 
parameters,  and  use  the  2D  shape  of  the  object  in  the  first  image  frame,  assuming  constant 
depth,  as  their  initial  shape.  To  avoid  the  requirement  for  odometry  measurements,  which 
will  not  be  available  in  many  situations,  we  use  the  paraperspective  factorization  method  to 
supply  initial  values  to  the  perspective  iteration  method. 


